Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm
نویسندگان
چکیده
One important feature of the gene expression data is that the number of genes M far exceeds the number of samples N. Standard statistical methods do not work well when N < M . Development of new methodologies or modification of existing methodologies is needed for the analysis of the microarray data. In this paper, we propose a novel analysis procedure for classifying the gene expression data. This procedure involves dimension reduction using kernel partial least squares (KPLS) and classification with logistic regression (discrimination) and other standard machine learning methods. KPLS is a generalization and nonlinear version of partial least squares (PLS). The proposed algorithm was applied to five different gene expression datasets involving human tumor samples. Comparison with other popular classification methods such as support vector machines and neural networks shows that our algorithm is very promising in classifying gene expression data. Introduction One important application of gene expression data is classification of samples into different categories, such as the types of tumor. Gene expression data are characterized by many variables on only a few observations. It has been observed that although there are thousands of genes for each observation, a few underlying gene components may account for much of the data variation. PLS provides an efficient way to find these underlying gene components and reduce the input dimensions (Nguyen and Rocke 2002). PLS is a method for modeling a linear relationship between a set of output variables and a set of input variables and has been extensively used in chemometrics. In general, the structure of chemometric data is similar to that of microarray data: small samples and high dimensionality. With this type of inputs, linear least squares regression often fails, but linear PLS excels. Rosipal and Trejo (2001) and Bennett and Embrechts (2003) extended PLS to nonlinear regression using kernel functions, mainly for the purpose of real value predictions. Nguyen and Rocke (2002) applied PLS/PCA, together with logistic discrimination, to classify the tumor data and Copyright c © 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. claimed success of their approach. However, their procedure is linear and limited with the implementation of SAS. In this paper we propose a novel analysis procedure for classification of tumor samples using gene expression profiles. Our algorithm combines KPLS with logistic regression. Involved in our procedure are three steps: feature space transformation, dimension reduction, and classification. The proposed algorithm has been applied to five different popular gene expression datasets. One is a two-class recognition problem (AML versus ALL), and the other four concern multiple classes. Algorithm A gene expression dataset with M genes (features) and N mRNA samples (observations) can be conveniently represented by the following gene expression matrix
منابع مشابه
Feature Selection and Classification of Microarray Gene Expression Data of Ovarian Carcinoma Patients using Weighted Voting Support Vector Machine
We can reach by DNA microarray gene expression to such wealth of information with thousands of variables (genes). Analysis of this information can show genetic reasons of disease and tumor differences. In this study we try to reduce high-dimensional data by statistical method to select valuable genes with high impact as biomarkers and then classify ovarian tumor based on gene expression data of...
متن کاملKernel Partial Least Squares for Stationary Data
We consider the kernel partial least squares algorithm for non-parametric regression with stationary dependent data. Probabilistic convergence rates of the kernel partial least squares estimator to the true regression function are established under a source and an effective dimensionality condition. It is shown both theoretically and in simulations that long range dependence results in slower c...
متن کاملGene Function Prediction from Functional Association Networks Using Kernel Partial Least Squares Regression
With the growing availability of large-scale biological datasets, automated methods of extracting functionally meaningful information from this data are becoming increasingly important. Data relating to functional association between genes or proteins, such as co-expression or functional association, is often represented in terms of gene or protein networks. Several methods of predicting gene f...
متن کاملNear-Infrared Spectroscopy Coupled with Kernel Partial Least Squares-Discriminant Analysis for Rapid Screening Water Containing Malathion
Near-infrared spectroscopy coupled with kernel partial least squares-discriminant analysis was used to rapidly screen water containing malathion. In the wavenumber of 4348 cm to 9091 cm, the overall correct classification rate of kernel partial least squares-discriminant analysis was 100% for training set, and 100% for test set, with the lowest concentration detected malathion residues in water...
متن کاملMulti-Kernel Partial Least Squares Regression Modeling based on Adaptive Genetic Algorithm
Kernel learning based soft sensor model has been focus of the machine learning domain. Kernel partial least squares (KPLS) algorithm can construct nonlinear models using the extract latent variables from the input and output data space simultaneously. However, the generalization of KPLS model relies on the model’s kernel type and kernel parameter for different modeling data. Thus, linear combin...
متن کامل